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Abstract 

Microreboots restart fine-grained components of soft- 
ware systems "with a clean slate," and only take a frac- 
tion of the time needed for full system reboot. Microreboots 
provide an application-generic recovery technique for In- 
ternet services, which can be supported entirely in middle- 
ware and requires no changes to the applications or any a 
priori knowledge of application semantics. 

This paper investigates the effect of microreboots on end- 
users of an eBay-like online auction application; we find 
that microreboots are nearly as effective as full reboots, but 
are significantly less disruptive in terms of downtime and 
lost work. In our experiments, microreboots reduced the 
number of failed user requests by 65% and the perceived 
downtime by 78% compared to a server process restart. We 
also show how to replace user-visible transient failures with 
transparent call-retry, at the cost of a slight increase in end- 
user-visible latency during recovery. Due to their low cost, 
microreboots can be used aggressively, even when their ne- 
cessity is less than certain, hence adding to the reduced re- 
covery time a reduction in the fault detection time, which 
further improves availability. 



1 Introduction 

Transient faults account for a large fraction of failures 
in today's Internet systems and production software in gen- 
eral 1 37 2 1; even mainframe-class operating systems are not 
immune to such transients |41|. Running out of memory 
or file descriptors, Heisenbug-triggering load spikes, dead- 
locks, performance degradation due to unexplained inter- 
actions between subsystems, etc. are just few examples 
of what Internet service operators face on a regular ba- 
sis f30|[l5l. 

Reboots have been shown to be an effective way to cure 
many such transients, even in critical software 1111321 . Full- 
system reboots, however, can be expensive P22^ both in 
terms of downtime and amount of disruption; to mitigate 
this, we introduce the concept of a microreboot. 



In this paper, we demonstrate that microreboots can be 
used to improve the availability of applications hosted on 
a rich middleware platform. A microreboot is a restart of 
a subset of fine-grained (smaller than a process) software 
components in a running system. In this paper, the system in 
question can be any Java 2 Enterprise Edition (J2EE) appli- 
cation running on an open-source J2EE application server 
that we have augmented with fault injection, instrumenta- 
tion, and the ability to microreboot individual beans (appli- 
cation components) as well as functional subsystems such 
as the Web server tier and Java Servlet Pages. 

Microrebooting reflects the emphasis on improving 
availability by lowering mean time to recover (MTTR); 
availabiUty is commonly thought of as MTTF/{MTTF+ 
MTTR). With respect to interactive services, lowering 
MTTR not only improves the user experience of the service 
and the users' perception of service availability |46 18|, 
but also serves as a leverage point for applying aggressive 
statistical-anomaly -based failure detection I31II14I . In our 
case, we also demonstrate that microreboots can reduce the 
number of users impacted by a particular transient failure 
and the amount of work they lose. 

Because our approach is based on observation and con- 
trol at the middleware layer, it is application-generic and re- 
quires no a priori knowledge of application structure. This 
addresses the fact that today's services are heterogeneous 
and dynamic, encompassing many vendors' hardware and 
software components that evolve rapidly and often turbu- 
lently, resulting in a main challenge in maintaining depend- 
abihty of those services i20i . 

1.1 Contributions 

The main contributions of this paper are to demonstrate 
the efficacy of microreboots as a technique for improving 
the availability of distributed interactive applications and 
to circumscribe the types of failures and applications for 
which microreboots are effective. Specifically: 

• We identify a user-centric availability metric for char- 
acterizing the availability of Internet services. This 
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metric reflects the observation that not all types of user 
interactions contribute equally to user-perceived sys- 
tem availability. 

• We augment an off-the-shelf Java application server 
with the ability to microreboot individual components 
in unmodified J2EE applications. We show that mi- 
crorebooting one or a small number of components 
is just as effective as restarting the entire application 
or server (full reboots are currently the most common 
method in use for recovering from transients L30..16J ). 
but that microrebooting is faster, leading to lower re- 
covery time. 

• We show that, since microreboots are less disruptive, 
using them for recovery can reduce the number of end 
users who actually experience the failure and allows 
some transient failures to be masked by additional re- 
quest latency. At the same time, the total amount of 
work lost during recovery is reduced. 

• We identify specific cases in which microreboots do 
not result in satisfactory recovery, explain why this is 
the case, and propose concrete changes to the middle- 
ware (not to the applications themselves) to remedy 
this. 

Microreboots separate the concern of recovery from that 
of diagnosis and bug finding. When an online system fails, 
downtime is expensive and the first priority is to restore ser- 
vice by any means necessary. Identifying and fixing the root 
cause of the transient failure is a separate effort, and we do 
not claim that microrebooting offers any insight into doing 
these things, nor that it is more than a "temporary fix" to re- 
covering from the transient. We expect production systems 
to have thorough logging mechanisms that will allow devel- 
opers to fix root causes; microreboots improve availability 
(by lowering MTTR) without changing reliability (reflected 
in MTTF). 

Therefore, in this paper we try to isolate the effects of 
microrebooting as a recovery procedure as much as possi- 
ble. In particular, our microbenchmarks trigger recovery 
actions directly, rather than injecting faults and waiting for 
fault detection to trigger recovery. We recognize that reduc- 
ing fault detection time is critical, and we explain how mi- 
croreboots have the potential to enable the use of promising 
new approaches to fast and aggressive detection. Similarly, 
we recognize that understanding what to microreboot when 
a failure occurs is important; we have addressed that prob- 
lem elsewhere 1 6 1 and we use the results of that technique 
to drive the experiments in this paper 

In Section |2] we describe the microrebooting approach 
and motivate its use for three-tier Internet applications. We 
then describe the changes we made to enable microreboots 



in JBoss, an open-source J2EE application server (middle- 
ware platform). Section|3ldescribes our experimental setup, 
sample application, and metrics. Sectionl^presents exper- 
imental results, using trace playback and induced recovery, 
to support our claims. We discuss implications of the ap- 
proach in Section |5l survey related work in Section |6l and 
then conclude. 

2 What Are Microreboots? 

In this section we explore the conditions under which 
reboot-based recovery is feasible and describe the concept 
of a microreboot. We present the platform chosen for our 
work, how it satisfies the conditions for reboot-based recov- 
ery, and conclude with a description of our implementation 
of the microreboot mechanism. 

2.1 Reboot-based recovery 

Chandra and Chen |12| and Lowell and Chen [33] for- 
mulated an approach to application-generic recovery (i.e. 
recovery without application-specific knowledge) based on 
checkpointing, and demonstrated that relatively few exist- 
ing applications could be successfully recovered by this ap- 
proach. They studied both Unix-style monolithic applica- 
tions such as vi and large open-source Internet service com- 
ponents such as MySQL and Apache. 

However, part of the appeal of rebooting as a recovery 
technique |7| is precisely that it discards corrupted tran- 
sient state that might itself be the cause of the failure or 
whose cleanup may be necessary in order for recovery to 
succeed. Therefore we expect that replacing recovery with 
rebooting — which is logically equivalent to restarting from 
a checkpoint that is the start state of the component — is 
more likely to work, assuming it is safe to try. To ensure 
that it is safe to try, we must consider three environmental 
conditions: 

L Boundary: There must be a clear boundary around 
what is being rebooted, i.e., it should be possible to in- 
dicate unambiguously what state will be lost, what re- 
sources released, what loci of control returned to their 
start state, etc. For example, in the case of a process, 
the boundary is typically the process's heap and any 
kernel data structures or resources being maintained 
on the process's behalf. 

2. Loose coupling: if the entity being rebooted is part of 
a distributed system, other entities that communicate 
with it must be able to tolerate the reboot event as nor- 
mal. For example, in a distributed system, calls to an 
RPC server that has failed and is in the process of re- 
covering could be stalled or temporarily rerouted to a 
failover RPC server. 
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3. Preserving state and consistency: To avoid data loss, 
we must be sure that all state visible outside the com- 
ponent's boundary is either soft or discardable state or 
is committed to a separate persistent state store (which 
presumably has its own recovery procedures in case of 
failure). For example, UDP multicast tree information 
is discardable soft state whose reconstruction is explic- 
itly part of the corresponding routing protocol. 

Note that these conditions do not guarantee that reboot- 
ing will be a successful recovery method, only that it will 
not result in a change in application semantics (e.g., loss of 
data or loss of consistency). 

Analogous to rebooting, a microreboot is a logical restart 
of an application component that may be finer-grained than 
a process, but the same requirements apply. Architectures 
that have these properties, such as Microsoft's .NET and 
Sun's Java 2 Enterprise Edition (J2EE), provide an excellent 
platform for studying microreboots. For the work presented 
here we used J2EE; in the next section we briefly review the 
J2EE programming model and why J2EE applications meet 
our requirements. 

2.2 The synergy between ^RB and J2EE 

Most Internet applications are deployed in a "three tier" 
arrangement I26II11I that makes state management explicit. 
The presentation tier consists of stateless Web servers that 
handle and demultiplex incoming connections. The applica- 
tion logic tier runs the code that constitutes the application. 
Finally, the persistence tier stores state that is expected to 
survive across requests, whether per-user or across all users. 

J2EE 1421 is a model for constructing the application 
logic tier of such applications and lends itself well to a 
microreboot-based failure management approach. J2EE 
applications are constructed from reusable Java modules, 
called Enterprise Java Beans (EJBs). Beans run in the 
managed environment of an application server, which pro- 
vides containers in which the beans are instantiated and run, 
provides naming/directory/authentication services for inter- 
bean communication at runtime, etc. (see Figure^. 

To run a J2EE application, one must boot the operat- 
ing system, start the J2EE application server, start any nec- 
essary additional components required by the application 
(e.g., a database used for persistent state storage and the 
Web server front-ends), and finally "deploy" the applica- 
tion on the application server, i.e. instantiate each EJB in 
its container and allow the application to begin accepting 
requests from the Web servers. Once the application is run- 
ning, Java threads are mapped to incoming Web requests, 
and several EJBs may be called during servicing of a given 
request. Thus EJBs are really akin to event handlers: each 
EJB does not have a separate locus of control, rather a single 
thread "shepherds" the user request through multiple EJBs. 
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Figure 1 . Architectural diagram of JBoss. End-user re- 
quests enter via the HTTP front end and are serviced by a 
subset of the EJB components. The database provides per- 
sistence for application and sometimes session state. 



With this description, we can now see how the safe-reboot 
requirements from Section lzTl map onto J2EE applications. 

Item (1), clear boundaries around microrebootable com- 
ponents, maps onto J2EE in at least two different ways. 
First, each EJB is a well-circumscribed entity that can be 
microrebooted by undeploying it and redeploying it. Sec- 
ond, the Web server processes that dispatch incoming HTTP 
traffic to EJBs are also self-contained and can be restarted. 
We describe in detail how this is done in Section l23l 

For item (2), loose coupling, observe that since inter- 
bean calls are mediated by the application server, we can 
modify the application server to intercept on those calls. 
In particular, if an EJB is in the process of being microre- 
booted, we can stall calls to that EJB rather than allowing 
them to experience an error as a result of the callee EJB 
being unavailable. 

Item (3), maintenance of persistent state, essentially falls 
out of the J2EE application model. J2EE applications ma- 
nipulate two kinds of state that are visible from outside an 
EJBs boundary. Persistent state, such as user profiles and 
static content, is stored in a traditional RDBMS and ac- 
cessed via JDBC connectors. RDBMSs are well known for 
having robust, if not always fast, recovery procedures that 
provide strong data integrity guarantees. The second kind 
of non-transient state is session state, which is tied to the 
maintenance of a particular user's session (a set of related 
interactions with the service). Since HTTP is stateless and 
most browsers provide only cookie management facilities, 
any nontrivial session state must be managed by the service. 
J2EE provides an abstraction called a stateful session bean 
that preserves session state across invocations, but the pre- 
cise implementation of this abstraction varies among J2EE 
application server implementations. As we describe in Sec- 
tion|3] a deficient implementation of this feature could cause 
microreboots to change application semantics; in that same 
section we propose a solution to this problem that leverages 
existing research and does not require changing individual 
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applications. 

In summary, the J2EE application model is a good fit for 
the requirements of microrebooting, with the possible ex- 
ception of the management of session state, which we will 
return to in Section|4] We now describe the particular J2EE 
application server implementation that we augmented for 
the work in this paper. 

2.3 Microreboots in JBoss 

We built upon the popular JBoss J2EE server fTT], be- 
cause it is open source and because its performance and 
features compare favorably with proprietary closed-source 
offerings 1 3 1. Its use in production environments is increas- 
ing rapidly, having had over 3 million downloads this year 
We instrumented JBoss in a number of ways; a description 
of the early changes we made appeared in 1 9 1 . In this pa- 
per we focus on how we enabled the application server for 
microreboots. 

Our basic microrebootable component of J2EE applica- 
tions is the EJB. In the same way an OS kernel does for 
processes, JBoss maintains for each active EJB a rich set of 
metadata. Some of the items include the name under which 
this component is known to other parts of the application, 
the Java class implementing its functionality, the type of 
EJB (session, entity, etc.), whether the bean requires trans- 
actional support, along with references to other beans that 
this EJB might call and references to the resources required 
by the EJB. 

JBoss already includes a mechanism for cleanly "shut- 
ting down" an EJB; our microreboot mechanism builds 
upon that. A Java class runnable as an EJB must imple- 
ment the standard EJB interface. When an EJB is created, 
its constructor ejbCreate () gets invoked and, when it 
is destroyed, e jbRemove ( ) allows it to clean up prior to 
deletion. When the application activates an EJB to process 
a request, it invokes the e jbActivate { ) method; when 
the EJB is disassociated, its e jbPassivate () method is 
called. When microrebooting an EJB, our version of JBoss 
discards all metadata associated with the bean and resets the 
corresponding entries in the server structures that keep track 
of the bean; if the EJB was involved in any ongoing transac- 
tions, those transactions are aborted. To restart the EJB, we 
simply use the existing deploy mechanism to reinstantiate 
it. 

In the simple call model, whenever an EJB wants to in- 
voke another EJB's method, it looks up the target EJB by 
name in the Java Naming Directory (JNDI) and uses the 
Java class resulting from the lookup to make the invocation 
(similar to the way RPC stubs work). Since doing a lookup 
on every call is expensive, JBoss provides the caller with a 
proxy on the first lookup, which then handles all subsequent 
calls without interacting with JNDI. 



A problem can arise when a recovering EJB is called 
by another EJB or a servlet; all interactions between EJBs, 
however, are controlled by the application server When a 
bean is microrebooted, other components can see the effect 
of this in one of two ways: either the bean is not currently 
registered in JNDI (which would result in a failure when the 
bean reference is looked up), or it is not available for pro- 
cessing requests. In either case, we arrange for the calling 
proxy to receive a RetryLater(t) exception, where t is the 
estimate, in milliseconds, of how much longer it will take 
for the callee to recover. 

EJBs run inside bean containers, and it is these contain- 
ers that are in charge of making the various calls outside the 
bean. Our modified JBoss container catches RetryLater( t) 
exceptions, pauses for approximately t milliseconds, and 
retries the call in the hope that the target bean has recov- 
ered. If the call succeeds after a predetermined number of 
tries, the original bean code sees a successful call and has 
no knowledge that the target bean had actually failed and re- 
covered in the meantime. Otherwise, the container throws 
an exception to the caller. These transparent retries allow 
us to mask transient failures from callers, as will be shown 
in Section l4!4l Such masking makes transient failure an ac- 
ceptable mode of operation. 

A major issue when transparently retrying calls is idem- 
potency. In the present case, however, JBoss guarantees in- 
vocations to be atomic: if the call can be made to the com- 
ponent, it goes through, otherwise the RetryLater exception 
is thrown; this preserves JBoss's regular call semantics. As 
expected, calls that were in-progress at the time of the mi- 
croreboot will fail in exactly the same way they would fail 
if the EJB crashed, had a bug, etc. 



3 Experimental Setup and Metrics 



In this section we describe the testbed for our work; we 
describe briefly our sample J2EE application, the client em- 
ulator used to generate load on the application, and then 
conclude with a definition of the metric used to quantify the 
benefits of microreboots. 

In our hardware setup we tried to mimic what would be 
typical of a small Internet service. We use Linux RedHat 
9.0; JBoss and the Web tier run on an AMD Athlon XP 
2600H- PC with 1.5 GB RAM, and the database (MySQL 
Max 3.23) on another identical node. We use the Sun 
HotSpot JVM L4.L and allocate it 1 GB of RAM through 
command-line arguments. Our client emulator, described 
below, runs on on a dual Pentium III (2 x 866 MHz) with 1 
GB RAM. All machines are interconnected by a 100 Mbps 
Ethernet switch. 
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3.1 eBay-like test application 

RUBiS 111! is an open-source web-based auction appli- 
cation, developed at Rice University and modeled on eBay. 
It offers selling, browsing and bidding. It distinguishes 
three kinds of users: visitor, buyer, and seller, with buyer 
and seller sessions requiring login. A buyer can bid on 
items and consult a summary of their current bids, rating 
and comments left by other users. Seller sessions require a 
fee before a user is allowed to put up an item for sale. The 
seller can specify a reserve (minimum) price for an item. 
RUBiS contains 582 Java files and about 26K lines of code; 
it uses MySQL for the database back end and stores 7 tables. 
In the default configuration, RUBiS has about 33,000 items 
for sale, distributed among eBay's 40 categories and 62 re- 
gions. There is an average of 10 bids per item, or 330,000 
entries in the bids table. The users table has 1 million en- 
tries. 

We obtained a description of the failure dependen- 
cies between RUBiS 's components using automated fault- 
propagation inference (AFPI) |6|. As can be seen in Fig- 
ure IH the majority of such dependencies in RUBiS are be- 
tween the stateless servlets and EJBs; the only beans that 
can propagate faults to other beans are IDManagerEJB, 
ItemEJB, CategoryEJB, and UserEJB. AFPI infor- 
mation is collected during a completely automated fault- 
injection campaign that requires no a priori knowledge of 
the applications' structure or semantics. In this paper, how- 
ever, we focus on /iRB-ing as a technique, not on how we 
might use /iRBs in a production system or what policies we 
might devise based on AFPI-generated graphs. 




Figure 2. Fault propagation map for RUBiS, obtained 
with AFPI. Shaded boxes represent EJBs, clear boxes repre- 
sent servlets (which are inherently stateless). An edge from 
Aio B indicates that a fault in A was observed to propagate 
to B\ microrebooting A entails microrebooting the transi- 
tive closure of A over this fault propagation map. 

For RUBiS, a simple recovery policy could be based 
on the fact that there is a known mapping from the URL 
being accessed to the action being taken (and hence the 
EJBs being touched). For instance, if a failure is seen on 



http://ejb_rubis_web/servlet/BrowseCategories, then we know 
that something in the path starting at the BrowseCategories 
servlet has gone wrong, and the system should therefore au- 
tomatically /iRB the corresponding components. 

3.2 Client emulator 

Our client emulator is a modified version of the load gen- 
erator that ships with RUBiS. 

We describe the workload of the simulated clients us- 
ing a state transition table T that has the client's possible 
states as rows and columns; these states correspond natu- 
rally to the various operations possible in RUBiS, such as 
Register, SearchltemsInCategory, PutBid, etc. (27 in total). 
In addition to the application-specific states, we also have 
two states coiTesponding to the user hitting the back button 
(Back) and spontaneously deciding to end his/her session 
(End). T{Sa, Sb) represents the probability of a client tran- 
sitioning from Sa to Sh', e.g., r(ViewItem, BuyNow) de- 
scribes the probability we associate with the user clicking 
on the "Buy Now" button while viewing an item's descrip- 
tion. 

The client emulator uses this table to automatically nav- 
igate the web site; when in a given state s, it will randomly 
choose the next state based on T(s) with the requested prob- 
ability; it then constructs the URL for this state and "clicks" 
on it. The table T also has a column for how long a user 
waits inbetween clicking from a certain state to the next; 
in our experiments however we set this wait time to zero. 
UnUke a real user, our emulator will therefore initiate the 
next HTTP request as soon as the cuiTent request completes 
(whether successfully or not). The emulator uses one thread 
per simulated client. 

We classify responses from the server as correct or in- 
correct. Correct responses are used to compute the server's 
goodput (throughput of correct responses per second); we 
will describe our other metrics in more detail later A 
response will be classified as incoiTect if it results in a 
network-level error (cannot connect to server, etc.), an 
HTTP 4xx or 5xx error code, or an HTML page containing 
particular keywords that we know to be indicative of appli- 
cation errors. Clearly this is not fully application generic; 
we have successfully detected all error pages in RuBiS and 
other J2EE applications with no false positives by searching 
for "error", "failed", and "exception" in the reply HTML, 
but this assumes none of the users is selling an item that 
would match these searches. 

3.3 Metrics 

A typical interaction of a client with the web site pro- 
ceeds as follows: client goes to the homepage, browses 
around for a while performing different site actions (search- 
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Raw Response Profile (1 client) 



Session Profile (1 client) 



Correctly satisfied requests - 
Unsatisfied requestSrJ 




Successfully completed sessions 
Aborted sessions ■ 




Time [seconds] 



Time [seconds] 



Figure 3. Goodput anomaly: We induced a failure (unrecovered) in QueryEJB alt = 30, thus causing certain queries against tiie 
database to fail. On the left we show a request profile (both good and failed requests); on the left we show a profile of the sessions 
(both the ones that completed and those that failed). Witness the goodput anomaly: in the face of failure, the raw goodput increases 
simultaneously with the rate of unsatisfied requests. The session profile, however, shows that the number of aborted sessions goes 
from zero to an average of 14 aborted sessions/second. Users start a new session after the previous one has failed, so the increased 
number of aborted sessions brings about an increase in the session initiations — in real services, this could constitute an unwelcomed 
load spike for certain types of services, such as user login and authentication. 



ing, etc.), and then decides to do something that touches the 
persistent-state database (e.g., place a bid, leave a comment 
for a user, update his/her profile, etc.). The DB-touching 
operation(s) usually require the user to have logged in. We 
assume that all interactions preceding the persistent-state 
update are just precursors to the real action (the DB update 
is sort of a "crowning moment"); thus, these interactions 
serve no purpose (with respect to the proposed metric) in 
the absence of a successful "crowning moment." 

For the purpose of this paper, we define a session to be 
a sequence of interactions with the web site that starts at 
the homepage and ends with the emulated user either aban- 
doning the site or returning to the homepage (thus starting 
a new session). Note that if something goes wrong during 
a session, users often try to logout and log back in (which 
typically requires going to the homepage), so it is reason- 
able to define a session as the sequence of URLs bracketed 
by accesses to the homepage. This definition relies on user 
behavior to infer when s/he is done with the site, rather than 
trying to understand application semantics. 

A simple approach to measuring the effect of downtime 
on end users would be to measure goodput (number of re- 
quests completed successfully) under partial-failure condi- 
tions, averaged across all clients. This is usually how per- 
formability 1 35 1 is measured — the amount of work success- 
fully completed over a period of time in the presence of 
partial failures. 

But the simple goodput metric fails to distinguish be- 
tween potentially long-running DB-touching operations and 
simple, fast browse-only operations. The surprising re- 



sult of inducing failures during long-running DB opera- 
tions is that the goodput actually goes up in the presence 
of failure, because the user no longer waits for a long- 
running operation — it fails right away and the client emula- 
tor moves on to the next (non-DB-touching) operations. In 
other words, as figure |3l shows, the simple goodput metric 
would fail to capture that some operations are more "valu- 
able" than others, and executing many "simple" operations 
does not necessarily compensate for failing to execute a few 
long-running ones. 



Instead we propose two metrics. The first, Gses , counts 
the number of sessions in which all operations completed 
successfully (i.e. every operation within the session was 
failure-free). In Figure |3] we illustrate this metric for 1 
client. Note that this is a pessimistic metric: the user may 
believe s/he has accomplished useful work during the ses- 
sion, but unless every operation succeeds, the session is not 
counted as successful. The second metric, session-weighted 
goodput (Gwop ), weighs each session by the number of op- 
erations in it. Viewed differently, Gwop measures standard 
throughput of successful and failed requests respectively, 
but whenever a session fails, all the operations in that ses- 
sion are counted as failed. Unlike Gses , Gwop is able to 
capture the fact that when a long session succeeds, the user 
got a lot more done than when a short session succeeds. Fig- 
ure|4]illustrates this metric. We will use primarily Gwop to 
quantify the effect of microreboots in our experiments. 
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Session-Weighted Response Profile (no faults) 



Session-Weighted Response Profile (injected fault) 
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Figure 4. Session-weighted goodput: On the left we see a run with no failures; on the right, we inject a permanent/unrecovered 
fault in QueryEJB at t — 100 sec. Notice how this causes unsatisfied requests to show up even "before" t — 100 sec, because 
operations done as part of sessions that started at i < 100 and failed at i > 100 are marked as failed by the Gwop metric. Even 
though these individual requests were satisfied, the usefulness of satisfying them is lost because of the session failure. We also notice 
that the throughput of satisfied requests starts dropping prior to t = 100, for similar reasons: some sessions starting in that interval 
end up failing. 



Session-Weighted Goodput (1 000 clients) 



Session-Weighted Goodput (20 clients) 




Time [seconds] 



Time [seconds] 



Figure 5. Sweetspot: On the left we show the session-weighted goodput for a 1000-client load; on the right we show the same 
for a 20-client load. We empirically determined that the experimental results are most consistent for 20 concurrent fast clients. For 
higher numbers than that, our single-node application server starts thrashing and throughput becomes erratic and slightly lower on 
average, as seen in the left graph. Moreover, inspecting traces of 20 clients for correctness is considerably easier than 100 or 1000 
clients. 



4 Results 

In order to experimentally isolate the fiRB recovery 
mechanism from fault detection, we initiated various forms 
of rebooting in the application without actually injecting 
faults. The aspect of detecting such faults was the focus 
of O- We are, therefore, assuming in all our experiments 
that the application server has instantaneously detected the 
fault and initiated reboot-based recovery. Note that the 
problem of failure detection is orthogonal to the recovery 



method used, though in section|5]we argue that /iRBs make 
it potentially much easier to apply certain kinds of fast fail- 
ure detection algorithms. 

Reboot-based recovery is typical for many failures no- 
ticed in deployed Internet systems, where resource leaks, 
deadlocks, etc. occur on a regular basis |30|. In fact, the 
original version of RUBiS/JBoss had a bug that caused it 
to deadlock when the number of concuiTent users exceeded 
10, and we have shown in |9| that a modified JBoss could 
automatically recover from this deadlock. The version of 



7 



JBoss Application Server Reboot (20 clients) RUBiS Application Restart (20 clients) 




Figure 6. Microreboots vs. other forms of reboot: Graph (a) depicts the impact on end users of a full JBoss server process restart: 
713 failed requests over a time span of 108 seconds. Graph (b) shows a full RUBiS application restart: 615 failed requests over 94 
seconds. Graph (c) shows the impact of microrebooting one EJB with no call retries: 251 failed requests over 24 seconds. Graph 
(d) shows the impact of simultaneously microrebooting 3 mutually-dependent EJBs with no call retries: 345 failed requests over 29 
seconds. 



RUBiS used in our experiments incorporates a number of 
fixes to enable it to run with many concurrent users. 

For most of the resuhs reported here we used 20 con- 
current clients with no think time inbetween successive re- 
quests. Given that a human user typically spends in excess 
of 2 seconds between clicks, we believe the load placed by 
one of our simulated clients is equivalent to that of 100 or 
more real clients. We did not want to introduce artificial 
think time (the way is done, for instance, in the TPC-W 
benchmark) because having think time would add one more 
variable to the experiment and would not offer any useful 
insight for ^RB experiments. The reason we settled for 20 
concurrent rapid clients is that it appeared to be the thresh- 
old beyond which thrashing and other side effects would re- 
duce throughput; see Figure|5lfor a comparison of through- 
put for 20 vs. 1000 clients. 

An important characteristic of our chosen workload is 
that it covers all possible RUBiS operations; experimen- 
tally we have determined that, in runs lasting 1 minute or 
longer with 20 clients, we routinely exercised all compo- 
nents. While this might be surprising for a set of 20 hu- 



man users, our no-wait-time clients navigate through the 
site very rapidly. The workload we used for the experi- 
ments reported here had an approximate mix of 85% read 
operations and 15% DB write operations. 

In the remainder of this section we will show that /j,RB- 
ing is faster and less disruptive than other forms of reboot, 
discuss correctness in the presence of /iRB-ing, and con- 
clude with a technique that, in conjunction with microre- 
boots, can mask transient failures from end users. 

4.1 Microreboots are fast 

Our first goal was to determine whether the microreboot 
mechanism we built into JBoss can indeed reduce recovery 
time. We performed four sets of experiments, comparing 
full reboot of the JBoss application server, full reboot of the 
RUBiS application, microreboot of one EJB (QueryEJB), 
and microreboot of multiple dependent EJBs (UserEJB, Ite- 
mEJB, and BidEJB), respectively. This would be a normal 
response to a variety of reboot-curable failures, such as a 
bean running out of memory (quite frequent in Java sys- 
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tems) or being hung in a deadlocked transaction. We can 
think of these experiments as attempting to recover a failure 
in one of the application components; Figure |S] shows the 
results. 

In the four graphs we show profiles both of successfully 
served requests and failed requests. We are particularly in- 
terested in reducing the total number of failed requests (area 
under the bold curves in the graphs), as this reflects expo- 
sure of the failure to end users. We also want to reduce the 
amount of time during which the service appears down to 
any user In the case of single-component microreboot, we 
reduced the number of failed requests by a factor of 2.84 
over server process reboot and by a factor of 2.45 over an 
application restart. The duration of time during which the 
site was perceived down by some users was reduced by a 
factor of 4.5 over server process restart and by a factor of 
3.92 over application restart. 

In graph (d) we show the effect of microrebooting a 
group of EJBs. This can be required either because multiple 
components have failed, or because there are dependencies 
between components. In this case, we may be reacting to a 
failure in UserEJB; as can be seen in Figure|2] the transitive 
closure of UserEJB over the fault propagation graph is the 
group of beans UserEJB, ItemEJB, and BidEJB. Based on 
this information, we microreboot the three beans together. 
By microrebooting the EJBs instead of restarting the appli- 
cation, we reduce the number of failed requests by a factor 
of 1.78 and the downtime by a factor of 3.24. Table^sum- 
marizes the results of these experiments. 



Recovery 


Failed 


Downtime 


Improve over JBoss restart 


Technique 


Reqs 


[sec] 


Requests 


Downtime 


JBoss restart 


713 


108 






RUBiS restart 


615 


94 


14% 


13% 


1-EJB /iRB 


251 


24 


65% 


78% 


3-EJB /iRB 


345 


29 


52% 


73% 



Table 1 . improvement relative to process restart: Com- 
parison of application restart, one-EJB microreboot, and 
three-EJB microreboot in terms of failed requests and per- 
ceived downtime. 



4,2 Microreboots are less disruptive 

If a recovery method is effective, then users of the recov- 
ered system should be able to resume their work immedi- 
ately following the recovery of the system under load. As 
evidenced by the lack of failed requests after recovery com- 
pletes (Figure |6j, the system sustains all four methods of 
recovery equally well, and users can continue their work af- 
ter microreboots just as in the case of regular reboots. In 
fact, the implementation of whole-application reboot is in 
effect a collection of microreboots, because "rebooting" the 



application consists of undeploying and then redeploying 
the individual application components. 

As can be seen in Figure |6l microreboots have the po- 
tential to be considerably less disruptive than either server 
or application restarts. For example, in graphs (a) and (b) 
goodput drops down all the way to zero, while the rate 
of failed requests does not increase dramatically. In both 
cases, the application loses all its network connections to 
the clients; when clients try to establish new connections, 
they are refused flat out, since no process is listening on the 
corresponding port. The result is that all users of the service 
experience downtime, hence the zero goodput. 

When using a //RB to recover a component, however, 
we are effectively partitioning the user population into two 
groups: those whose requests require the services of that 
component and those whose requests do not. As seen in 
graphs (c) and (d), goodput does not drop to zero even in- 
stantaneously, because non-affected (non-recovering) com- 
ponents can continue to deliver service while faulty ones are 
microrebooted. Users whose sessions do not require the re- 
covering component can continue working as if nothing had 
failed. Thus, in addition to recovering faster, microreboots 
also offer the opportunity to reduce the impact of failures 
on users who are active at the time of failure. 

4.3 Microreboots and correctness 

"Correct behavior" in the face of microreboots is diffi- 
cult to define, because by assumption microreboots are use- 
ful only because there are transient bugs in the application 
or server With or without /iRBs, such bugs clearly might 
cause incorrect behavior. However, we can argue that the 
effect of a /iRBis no different than the effect of a full re- 
boot. 

First, observe that the result of a particular user request 
depends only on the EJB's it calls and on any persistent 
state that would affect the way the bean handles the request 
(whether or not that state is directly visible outside the bean 
boundary). Instances of stateless beans are by definition 
indistinguishable from each other, and in fact application 
servers simply select an available instance from a pool of 
such beans when the bean is needed for processing an in- 
coming request. Bean-independent state stored in a trans- 
actional database (e.g., a list of all bids placed by a given 
user) is not affected any differently by full reboots vs. mi- 
croreboots of the application server: as long as the database 
provides transactional semantics, either event is "serialized" 
between two transactions. The remaining possibility is that 
the bean is a stateful session bean, which expects to pre- 
serve its state across invocations. In this case we need to 
know where the state is kept, and whether it would survive 
a /iRBof the bean. 

As we stated earlier, management of session state varies 
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across implementations of J2EE application servers. JBoss 
offers two options: (a) individual EJB's can manage their 
own session state by explicitly updating the transactional 
database; (b) JBoss can transparently manage session state, 
which it does by keeping it in RAM with no replication or 
backup. RUBiS happens to use (a), which means all beans' 
session state would survive microreboots of the beans them- 
selves. However, an application that used (b) would not 
have its session state preserved across a /iRB. In fact, we en- 
countered this problem in PetStore, a J2EE application that 
models a simple e-commerce site; after a /iRB, all subse- 
quent requests from the affected (simulated) user systemati- 
cally failed, until the (simulated) user abandoned the session 
and logged back in (thereby recreating a new valid session- 
state object), after which subsequent requests succeeded. In 
other words, /iRBs are not transparent to applications re- 
lying on JBoss's implementation of session state. (Note, 
however, that such applications won't survive full reboots 
either: in that case, all currently connected users would lose 
their sessions, not just the user whose session beans were 
/iRB'd.) 

We propose that the correct solution to this problem is to 
manage session state externally to the EJB's using a mech- 
anism that is much lighter-weight than a database (for per- 
formance and scalability) but provides strong guarantees of 
bounded persistence. In fact, |31 1 reports on a lightweight, 
RAM-only, replication-based session state storage mecha- 
nism whose contents survive microreboots and whose use 
does not compromise overall application throughput. Note 
that integrating this system into our prototype would not re- 
quire changes to the applications themselves, only to the 
application server's implementation of server-managed ses- 
sion state. We plan to explore the integration of this subsys- 
tem into our prototype in future work. 

4.4 Low-level retries mask transient failure 

When a caller tries to reach a recovering component, that 
call will typically fail. However, we can build in mecha- 
nisms for retrying such calls after the target component has 
recovered; such retries can be done in a manner completely 
transparent to the J2EE application, which means J2EE pro- 
grammers do not need to incorporate retry logic in their ap- 
plications. Microreboots coupled with transparent low-level 
retry would allow us to transform many transient failures 
into additional latency instead of externally visible failure. 
The effect of a failure on users would then simply be a per- 
formance hiccup, rather than the error they would have seen 
without retry mechanisms. As discussed in Section l231 on 
calls to recovering EJBs we provide the same semantics as 
vanilla JBoss, with the addition of a RetryLater(t) excep- 
tion. 

Such retries can be automated and performed transpar- 



ently at multiple levels. The highest level is provided by 
HTTP 1.1, which has a Retry-After response header, 
allowing the web server (or our application server) to in- 
struct the client's browser to retry after a certain number of 
seconds. At the lowest level, calls between EJBs can be 
retried if a particular EJB is microrebooting. 

Previous studies 1361 have found that, when a user is 
waiting for an interactive service to respond, a delay of 8-10 
seconds is the threshold after which the user comes to be- 
lieve that the request has failed and clicks the Reload or Stop 
button (or worse, clicks over to another site). This suggests 
that if a site can recover from a transient failure and retry the 
failed in-flight request(s) within 8 seconds, affected users 
will have the illusion of continuous uptime — they will see 
a short delay rather than a failure. Microrebooting an EJB 
takes less than 1 second, which permits us to use call retries 
to mask EJB failures from most end users. 



Session-Weighted Response Profile (microreboot w/ call retry) 




Time [seconds] 

Figure 7. Microreboots with call retry: QueryEJB is 
/iRB-ed at t = 100; in-flight requests are retried with a 
timeout of 100 msec. Goodput dips around t = 100 but 
none of the requests fail. In the case of multi-EJB microre- 
boot, the dip would be deeper and wider, but still no re- 
quests would be perceived as failed by end users (unless 
recovery time exceeds 8 seconds). 

We built into JBoss the ability to retry inter-EJB calls if 
the callee is microrebooting; as described in Section |231 a 
call to a recovering EJB results in a RetryLater(t) excep- 
tion. Using this facility with a hardcoded value of t = 100 
msec, we ran an experiment in which we microrebooted 
QueryEJB and observed the effect on end users. We do not 
compare these results to application-level reboot, because 
restart time exceed the above-mentioned threshold (restart- 
ing all of RUBiS takes on the order of 10-11 seconds with 
no load, and 20 or more seconds under normal load). 

In Figure0we show the session-weighted response pro- 
file for microrebooting QueryEJB with call-level retry. No- 
tice that the goodput dips slightly around the time when the 
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microreboot is made, because the system is now busy retry- 
ing in-flight request and can accept fewer new requests than 
under normal circumstances. However, the goodput only 
drops slightly and, most importantly, no request fails and 
hence all sessions can continue unaffected. Since incoming 
requests sit in the TCP connection queues, it is conceivable 
that if load was much higher and microreboot time longer, 
the TCP queues would fill up and connections would be re- 
fused (and the site perceived as down by end users). 

5 Discussion 

5.1 Cheap recovery allows occasional mistakes 

Microreboots enable an approach to self-management in 
which recovery is so fast and inexpensive that false posi- 
tives during failure detection become less important. The 
fact that microreboots are fast and safe potentially allows 
us to apply much more sophisticated failure detection poli- 
cies, which is important because total recovery time is often 
dominated by fault-detection time 1 1 3 1 . 

One promising direction involves using statistical 
anomaly detection to infer failures. Such techniques have 
been shown to reduce time-to-detection at the cost of some 
false positives |14|, and a simplified version of this ap- 
proach appears to have been successfully demonstrated in 
a state-management layer designed specifically to make re- 
boots fast and safe 1311 . 

In general we believe this is an important design trend for 
robust systems: fast, cheap recovery mechanisms will blur 
the line between "normal operation" and recovery. When 
recovery becomes an order of magnitude cheaper, it allows 
one to think differently about how and when to apply it. 
Since microreboots result in only a minor cost in goodput if 
applied by mistake, they provide the level of recovery per- 
formance needed to pursue a rationally-aggressive approach 
of initiating recovery at the slightest hint of failure. 

5.2 ^RB-ing in Internet services 

We believe the fiRB technique is best suited for large 
scale Internet services. The workloads faced by such ser- 
vices consist of short-lived, mostly-independent requests 
coming from a large population of distinct users. The work 
that Internet services must do is generally partitioned into 
disjoint sets of discrete operations, and RUBiS reflects this. 
The consequence is that, even if a few requests fail, it is pos- 
sible for most users to be unaffected. /iRBs take advantage 
of the application's structure to realize this potential. 

Additionally, the underlying protocol (HTTP) and most 
of the application logic is stateless and, except for marked, 
non-idempotent requests, end-users can safely retry failed 
requests until they succeed. This lets us reboot components 



in the system, knowing that any users affected will face 
only a minor inconvenience. In fact, this property makes it 
useful to recover even from purely deterministic bugs such 
as a pathologically malformed request: if recovery is fast 
enough, other users issuing non-pathological requests may 
still be able to use the service. 

Many Internet services today use huge in-memory 
caches in order to avoid the central database bottleneck 
(e.g., the servers at a large Internet portal use 64 GB of 
RAM just for caching database queries |39|). Unfortu- 
nately, a machine reboot flushes this cache, and re-warming 
it can take a long time (transferring 64 GB from a 40 
MB/sec wide-SCSI disk takes at least half an hour), which 
is why whole-system reboots are generally avoided. 



5.3 When reboot-based recovery fails 



Rebooting is a correctness-preserving form of restart 
only to the extent that no "critical" state is lost and no incon- 
sistency created. We identified three requirements for al- 
lowing safe reboots: a well-defined reboot boundary, loose 
coupling between the rebooting component and its peers, 
and preservation or state (or preservation of consistency of 
state) visible outside the reboot boundary. We chose to tar- 
get three-tiered Internet appUcations based on thick mid- 
dleware because the programming model largely enforces 
these properties already. Monolithic applications, in con- 
trast, typically lack these properties, and we would not ex- 
pect microreboots to work without extensive changes to the 
applications themselves. 

Also, since microrebooting still introduces nonzero re- 
covery latency, it may be inappropriate for systems with 
tight real-time constraints; hot standby with fast failover 
may be the only acceptable option in those cases, since even 
failover to a cold spare may take too long. Of course, sub- 
stantially all Internet services exploit some form of standby 
at multiple levels to mask transient failures 1 34|, but standby 
capacity is expensive so standbys are rarely kept idle dur- 
ing normal operation. As a result, when failover does oc- 
cur, the standby is serving more than its steady-state share 
of workload; presumably, the faster the primary is returned 
to its online condition, the higher total throughput can be 
reaHzed. In fact, the CNN.com meltdown on 9/11/01 129) 
demonstrated that slow node-level recovery time can lead to 
service collapse even when fast failover is in place. Ideally, 
if more transient failures could be masked as slight delays 
to the user, fast failover would not have to be used as often. 
Microreboots are thus orthogonal to failover as a technique, 
but they may result in failover being required much less of- 
ten. 
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5.4 Application-generic recovery 

In applying microreboots to the application server rather 
than by modifying individual applications, we have at- 
tempted to address the question: What level of effective 
fast recovery can we do exploiting just the middleware, 
and without application-specific efforts? We found that mi- 
croreboots can recover just as effectively from failures that 
today are usually cured by whole-system restart or whole- 
application restart, but they do so more rapidly. Moreover, 
the shorter downtime during recovery can sometimes be 
masked by a delay well within an established "distraction 
threshold" for most users, allowing the failure to be hidden 
from them; this is not practical for the longer recovery times 
required for whole-application restart. 

In our experience we have found few failures from 
which whole-system restart recovers but microreboots do 
not. Nevertheless, such failures exist; for example, we ob- 
served that under high load, the JVM running the applica- 
tion server would sometimes run out of file descriptors, or 
encounter an internal error, requiring a process restart of the 
JVM. We have also encountered a resource leak involving 
serialized objects sent over a socket: the object does not get 
garbage collected even when our references to it are gone, 
and eventually the leaks require a JVM restart. Finally, on 
our version of Linux, we also encountered on occasion a 
kernel bug in the swapping code which would trigger un- 
der high memory utilization conditions; in such cases, any 
memory allocation (specifically, any call to the brk system 
call) hangs, and a full system restart is necessary. 

Although none of these problems would have been cured 
by microreboots, we still expect microreboots to recover 
from a significant subset of those faults that a full reboot 
would cure. The strategy we proposed in |7j suggests that 
restarts be attempted at the finest granularity, and then pro- 
gressively encompass more components if the failure is not 
cured. In this sense, microreboots as presented here are 
an optimization over full reboots, albeit one that results in 
qualitatively less disruptive recovery than full reboots. 

6 Related Work 

Chandra and Chen II12I classified software faults into 
three categories. Environment-independent (EI) or "Bohr 
bugs" are deterministic and do not depend on the operating 
environment; environment-dependent-transient (EDT) or 
"Heisenbugs" are due to timing or other transient conditions 
and may disappear if the operation is retried; environment- 
dependent-nontransient (EDN) bugs are related to the oper- 
ating environment in such a way that immediate retry is not 
likely to work, because the environmental condition(s) re- 
sponsible for the bug will not have changed enough (for ex- 
ample, a failure due to a memory leak will persist until more 



heap space is made available). There is disagreement re- 
garding whether most bugs remaining in production-quality 
software are EDT |16| or EI/EDN |12|, but reboot-based 
recovery techniques address all three categories. 

Other projects attempting checkpoint-based recovery 
have fared similarly to Lowell and Chen [33]. AR- 
MORS 1281 provide a micro-checkpointing facility for ap- 
plication recovery, but applications must be (re)written to 
use it; limited protection is provided for legacy applica- 
tions without their own checkpointing code. ARMOR'S 
own fault detection and recovery middleware does use the 
microcheckpointing facility, but in the cases where middle- 
ware recovery failed 1441 . it was because of a corrupted 
checkpoint caused by an injected fault. Libft f241 provides 
a C library for checkpointing and the internal state of an 
application process periodically on a backup node, but like 
ARMOR, it requires applications to be written specifically 
to use this feature. 

Process pairs |4l were an early mechanism that com- 
bined resource redundancy and state mirroring to allow 
failover to a hot standby, but because they were difficult 
for programmers to use, they have had limited impact out- 
side of specialized high-end systems. Transactions 1 19] 
have enjoyed much wider impact, and remain a key ele- 
ment of today's Internet applications, because they are easy 
for programmers to use and export a clean abstraction for 
dealing with recovery; however, when combined with re- 
lational semantics, providing transactional guarantees re- 
quires substantial engineering in order to get both good 
steady-state performance and complete crash-safety and re- 
covery. Indeed, high-volume, high-performance database 
systems cost hundreds of thousands of dollars to deploy and 
maintain. Separating applications into stateless logic plus 
transactions simplifies recovery; we exploit this property by 
attempting application-generic recovery for the logic, and 
intend to push it further by specializing the state stores used 
for other kinds of Internet service state, including session 
state and persistent non-relational state such as user profiles. 

The authors of 13 8 j define a performability 1 35 1 metric 
for interactive Internet services that captures the decrease 
in throughput as they operate under fault conditions. Their 
work is an important step in applying performability to this 
domain, although as presented, it does not capture the effect 
of faulty operation on a typical end-user — how likely is it 
that a typical user request will fail or be delayed, and by 
how much? 

An important part of managing partial failures is isola- 
tion. Indeed, a significant use of VMware's software is 
fault isolation, and there is a new research focus on using 
lightweight virtual machines primarily for isolation. Re- 
cent efforts include Denali |45 1, which provides an Intel x86 
isolation kernel that omits support for some instructions 
in exchange for orders-of-magnitude lighterweight opera- 
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tion; JanosVM 1431 . which allows a single logical Java vir- 
tual machine to be split among multiple OS processes; and 
Luna 1 21 1 and the Sun MVM 1 17 1, which improve isolation 
among "tasks" running in a single Java VM. 

BASE 1 40 1 and BFT 1 10 1 try to detect and correct what 
would otherwise be silently-wrong answers, e.g. due to data 
corruption or a malicious adversary. Their work is com- 
plementary to ours and composes with it, though we note 
that session state corruption errors such as we encountered 
would be difficult for these approaches to find as well. 

7 Future Work 

We intend to study the feasibility of using microreboots 
for software rejuvenation 1 25 1, the preemptive reboot of ap- 
plications to stave off failure caused by software aging (e.g., 
due to resource leaks). By microrejuvenating one compo- 
nent at a time in a rolling fashion, we may be able to rejuve- 
nate the entire applications without any downtime or failed 
requests. Combined with recursive restarts j7|, a technique 
that selectively restarts system components either reactively 
or proactively, we hope microreboots can preserve the ben- 
efits of reboots while eliminating some of their drawbacks. 

Other research has focused on developing microreboot- 
safe storage systems for session state 131 1 and non- 
relational persistent state such as user profiles and catalog 
data |23|. We intend to integrate these subsystems with 
our prototype to realize a complete three-tier Internet ap- 
plication in which every subsystem is microrebootable. We 
hope to then use micoreboots to enforce simple, predictable 
fault models, by coercing any component failure into a fast- 
recovering crash failure. Using microreboots aggressively 
in this fashion has the potential to help with containment of 
faults and to improve predictability of component behavior 
in the face of failures. 

Finally, although in this paper we tried to identify the 
limitations of /iRBing on unmodified J2EE Internet appli- 
cations, our longer-term goal is the development of design 
rules, tools, and building blocks for writing componentized 
applications for which recovery is dramatically simplified 
because they are completely /iRB-safe — that is, crash-only 
applications (Sj. 

8 Conclusions 

We described some problems faced by Internet sites for 
which rebooting in response to failure may be appealing 
but too expensive, and proposed microreboots as a way to 
alleviate the problem. Microreboot-based recovery works 
well for componentized applications when the components 
have well-defined boundaries, when externally-visible (to 
the component) state is either discardable or persisted else- 



where, and when normal control flow mechanisms are de- 
signed to handle the case of a component being temporarily 
unavailable because it is recovering. These constraints are 
largely met by existing component architectures; we used 
a J2EE three-tier application to show that microreboots can 
improve recovery time and simplify recovery from transient 
faults, with no application-specific a priori knowledge. 

Because microreboots are predictable and lightweight, it 
becomes acceptable to make occasional mistakes regarding 
recovery, which in turn enables the future use of sensitive 
fault detection techniques for aggressive triggering of re- 
covery. This approach blurs the line between normal op- 
eration and recovery and leads toward the design of sys- 
tems that are always "recovering" as one way of adapting 
to changing conditions. We hope that this will result in one 
concrete path to "self-*" systems. 

The current release of our software is available for down- 
load at http://crash.stanford.edu/download; future versions 
will be posted there as well. 
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